Conditional Random Field
   HOME

TheInfoList



OR:

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in
pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
and used for
structured prediction Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Similar to commonly used supervised l ...
. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions. Other examples where CRFs are used are:
labeling Labelling or using a label is describing someone or something in a word or short phrase. For example, the label "criminal" may be used to describe someone who has broken a law. Labelling theory is a theory in sociology which ascribes labelling ...
or
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from Lati ...
of sequential data for
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
or biological sequences,
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
,
shallow parsing Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings ...
,
named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
,
gene finding In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functio ...
, peptide critical functional region finding, and object recognition and
image segmentation In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects ( sets of pixels). The goal of segmentation is to simpl ...
in
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
.


Description

CRFs are a type of discriminative
undirected In discrete mathematics, and more specifically in graph theory, a graph is a structure amounting to a set of objects in which some pairs of the objects are in some sense "related". The objects correspond to mathematical abstractions called '' v ...
probabilistic Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
graphical model A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a Graph (discrete mathematics), graph expresses the conditional dependence structure between random variables. They are ...
. Lafferty, McCallum and Pereira define a CRF on observations \boldsymbol and
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
s \boldsymbol as follows:
Let G = (V , E) be a graph such that \boldsymbol = (\boldsymbol_v)_, so that \boldsymbol is indexed by the vertices of G. Then (\boldsymbol, \boldsymbol) is a conditional random field when each random variable \boldsymbol_v, conditioned on \boldsymbol, obeys the
Markov property In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov propert ...
with respect to the graph; that is, its probability is dependent only on its neighbours in G: P(\boldsymbol_v , \boldsymbol, \) = P(\boldsymbol_v , \boldsymbol, \), where \mathit \sim v means that w and v are neighbors in G.
What this means is that a CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets \boldsymbol and \boldsymbol, the observed and output variables, respectively; the conditional distribution p(\boldsymbol, \boldsymbol) is then modeled.


Inference

For general graphs, the problem of exact inference in CRFs is intractable. The inference problem for a CRF is basically the same as for an MRF and the same arguments hold. However, there exist special cases for which exact inference is feasible: * If the graph is a chain or a tree, message passing algorithms yield exact solutions. The algorithms used in these cases are analogous to the forward-backward and
Viterbi algorithm The Viterbi algorithm is a dynamic programming algorithm for obtaining the maximum a posteriori probability estimate of the most likely sequence of hidden states—called the Viterbi path—that results in a sequence of observed events, especiall ...
for the case of HMMs. * If the CRF only contains pair-wise potentials and the energy is submodular, combinatorial min cut/max flow algorithms yield exact solutions. If exact inference is impossible, several algorithms can be used to obtain approximate solutions. These include: * Loopy belief propagation * Alpha expansion * Mean field inference *
Linear programming relaxation In mathematics, the relaxation of a (mixed) integer linear program is the problem that arises by removing the integrality constraint of each variable. For example, in a 0–1 integer program, all constraints are of the form :x_i\in\. The relax ...
s


Parameter Learning

Learning the parameters \theta is usually done by
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
learning for p(Y_i, X_i; \theta). If all nodes have exponential family distributions and all nodes are observed during training, this
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
is convex. It can be solved for example using
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
algorithms, or
Quasi-Newton method Quasi-Newton methods are methods used to either find zeroes or local maxima and minima of functions, as an alternative to Newton's method. They can be used if the Jacobian or Hessian is unavailable or is too expensive to compute at every iteration. ...
s such as the
L-BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...
algorithm. On the other hand, if some variables are unobserved, the inference problem has to be solved for these variables. Exact inference is intractable in general graphs, so approximations have to be used.


Examples

In sequence modeling, the graph of interest is usually a chain graph. An input sequence of observed variables X represents a sequence of observations and Y represents a hidden (or unknown) state variable that needs to be inferred given the observations. The Y_ are structured to form a chain, with an edge between each Y_ and Y_. As well as having a simple interpretation of the Y_ as "labels" for each element in the input sequence, this layout admits efficient algorithms for: * model ''training'', learning the conditional distributions between the Y_ and feature functions from some corpus of training data. * ''decoding'', determining the probability of a given label sequence Y given X. * ''inference'', determining the ''most likely'' label sequence Y given X. The conditional dependency of each Y_ on X is defined through a fixed set of ''feature functions'' of the form f(i, Y_, Y_, X), which can be thought of as measurements on the input sequence that partially determine the
likelihood The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of each possible value for Y_. The model assigns each feature a numerical weight and combines them to determine the probability of a certain value for Y_. Linear-chain CRFs have many of the same applications as conceptually simpler hidden Markov models (HMMs), but relax certain assumptions about the input and output sequence distributions. An HMM can loosely be understood as a CRF with very specific feature functions that use constant probabilities to model state transitions and emissions. Conversely, a CRF can loosely be understood as a generalization of an HMM that makes the constant transition probabilities into arbitrary functions that vary across the positions in the sequence of hidden states, depending on the input sequence. Notably, in contrast to HMMs, CRFs can contain any number of feature functions, the feature functions can inspect the entire input sequence X at any point during inference, and the range of the feature functions need not have a probabilistic interpretation.


Variants


Higher-order CRFs and semi-Markov CRFs

CRFs can be extended into higher order models by making each Y_ dependent on a fixed number k of previous variables Y_, ..., Y_. In conventional formulations of higher order CRFs, training and inference are only practical for small values of k (such as ''k'' ≤ 5), since their computational cost increases exponentially with k. However, another recent advance has managed to ameliorate these issues by leveraging concepts and tools from the field of Bayesian nonparametrics. Specifically, the CRF-infinity approach constitutes a CRF-type model that is capable of learning infinitely-long temporal dynamics in a scalable fashion. This is effected by introducing a novel potential function for CRFs that is based on the Sequence Memoizer (SM), a nonparametric Bayesian model for learning infinitely-long dynamics in sequential observations. To render such a model computationally tractable, CRF-infinity employs a
mean-field approximation In physics and probability theory, Mean-field theory (MFT) or Self-consistent field theory studies the behavior of high-dimensional random (stochastic) models by studying a simpler model that approximates the original by averaging over Degrees of ...
of the postulated novel potential functions (which are driven by an SM). This allows for devising efficient approximate training and inference algorithms for the model, without undermining its capability to capture and model temporal dependencies of arbitrary length. There exists another generalization of CRFs, the semi-Markov conditional random field (semi-CRF), which models variable-length ''segmentations'' of the label sequence Y. This provides much of the power of higher-order CRFs to model long-range dependencies of the Y_, at a reasonable computational cost. Finally, large-margin models for
structured prediction Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values. Similar to commonly used supervised l ...
, such as the
structured Support Vector Machine The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structure ...
can be seen as an alternative training procedure to CRFs.


Latent-dynamic conditional random field

Latent-dynamic conditional random fields (LDCRF) or discriminative probabilistic latent variable models (DPLVM) are a type of CRFs for sequence tagging tasks. They are
latent variable model A latent variable model is a statistical model that relates a set of observable variables (also called ''manifest variables'' or ''indicators'') to a set of latent variables. It is assumed that the responses on the indicators or manifest variabl ...
s that are trained discriminatively. In an LDCRF, like in any sequence tagging task, given a sequence of observations x = x_1,\dots,x_n, the main problem the model must solve is how to assign a sequence of labels y = y_1,\dots,y_n from one finite set of labels . Instead of directly modeling (y, x) as an ordinary linear-chain CRF would do, a set of latent variables h is "inserted" between x and y using the chain rule of probability: :P(\mathbf , \mathbf) = \sum_\mathbf P(\mathbf, \mathbf, \mathbf) P(\mathbf , \mathbf) This allows capturing latent structure between the observations and labels. While LDCRFs can be trained using quasi-Newton methods, a specialized version of the
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
algorithm called the latent-variable perceptron has been developed for them as well, based on Collins'
structured perceptron Structured prediction or structured (output) learning is an umbrella term for supervised learning, supervised machine learning techniques that involves predicting structured objects, rather than scalar statistical classification, discrete or Regre ...
algorithm. These models find applications in
computer vision Computer vision is an interdisciplinary scientific field that deals with how computers can gain high-level understanding from digital images or videos. From the perspective of engineering, it seeks to understand and automate tasks that the hum ...
, specifically
gesture recognition Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or sta ...
from video streams and
shallow parsing Shallow parsing (also chunking or light parsing) is an analysis of a sentence which first identifies constituent parts of sentences (nouns, verbs, adjectives, etc.) and then links them to higher order units that have discrete grammatical meanings ...
.


See also

*
Hammersley–Clifford theorem The Hammersley–Clifford theorem is a result in probability theory, mathematical statistics and statistical mechanics that gives necessary and sufficient conditions under which a strictly positive probability distribution (of events in a probabil ...
*
Maximum entropy Markov model In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discrimina ...
(MEMM)


References

{{reflist, 30em


Further reading

* McCallum, A.
Efficiently inducing features of conditional random fields
In: ''Proc. 19th Conference on Uncertainty in Artificial Intelligence''. (2003) * Wallach, H.M.
Conditional random fields: An introduction
Technical report MS-CIS-04-21, University of Pennsylvania (2004) * Sutton, C., McCallum, A.: An Introduction to Conditional Random Fields for Relational Learning. In "Introduction to Statistical Relational Learning". Edited by Lise Getoor and Ben Taskar. MIT Press. (2006
Online PDF
* Klinger, R., Tomanek, K.: Classical Probabilistic Models and Conditional Random Fields. Algorithm Engineering Report TR07-2-013, Department of Computer Science, Dortmund University of Technology, December 2007. ISSN 1864-4503
Online PDF
Graphical models Machine learning